First, we will read in the corpus and metadata to create a table. This table will represent each text as one row, and will have multiple columns including the filename, discipline, journal title, year of publication, article title, author names, and complete text of the document.
#Load relevant R packages
library(tidyverse)package ‘tidyverse’ was built under R version 3.6.2package ‘ggplot2’ was built under R version 3.6.2package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘readr’ was built under R version 3.6.2package ‘purrr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2package ‘forcats’ was built under R version 3.6.2
library(quanteda)undefined subclass "numericVector" of class "Mnumeric"; definition not updated
library(MASS)
#Load in corpus metadata to a table
metadata <- readxl::read_excel("corpus_metadata.xlsx")
#Create a function to read in all of the texts into a table
readtext_lite <- function(paths) {
# Get a vector of the file basenames
doc_ids <- basename(paths)
# Create a vector collapsing each text file into one element in a character vector
texts <- vapply(paths, function(i) paste(readLines(i), collapse = "\n"),
FUN.VALUE = character(1))
text_df <- data.frame(doc_id = doc_ids, text = texts, stringsAsFactors = FALSE)
return(text_df)
}
doc_df <- readtext_lite(metadata$Filename)
#Create a corpus from the data frame and attach metadata
full_corpus <- corpus(doc_df, docid_field="doc_id", text_field="text")
docvars(full_corpus) <- metadataNext, we create normalized counts of function words in each text. The list of function words used here is drawn from Longman’s Grammar (1999).
#Count tokens in each text
corpus_tokens <- tokens(full_corpus, include_docvars=TRUE, remove_punct = TRUE,
remove_numbers = FALSE, remove_symbols = TRUE, what = "word")
#To account for phrasal function words, combine phrases together with underscores
multiword_expressions <- readLines("phrasal_fxn_words.txt")
corpus_tokens <- tokens_compound(corpus_tokens, pattern = phrase(multiword_expressions))
rawtokens_dfm <- dfm(corpus_tokens)
#Create vector of function words
fxn_words = readLines("fxn_words.txt")
#Select only function words
fxnword_rawtokens_dfm <-dfm_select(rawtokens_dfm, pattern = fxn_words, selection = "keep")
#Normalize token counts per 10,000 words
fxnword_normalized_dfm <- 10000*(dfm_weight(fxnword_rawtokens_dfm, scheme = "prop"))Here is a small sample of what we have created. Below is a table with a text on each row, and the normalized (by 10,000) counts of a few function words.
Document-feature matrix of: 5 documents, 4 features (0.0% sparse).
5 x 4 sparse Matrix of class "dfm"
features
docs are in a of
cs_cvpr_2017_1.txt 157.3346 421.1013 578.4359 527.5335
cs_cvpr_2017_2.txt 183.4352 683.7132 311.2840 767.0928
cs_cvpr_2017_3.txt 183.5915 499.1394 395.8692 510.6139
cs_cvpr_2017_4.txt 329.0247 458.2844 511.1633 652.1739
cs_cvpr_2017_5.txt 383.6572 662.6806 443.4479 642.7504
Before jumping into more complicated stylometry analyses, we’re first going to do some exploratory data analysis. In this section, we’ll perform some more common text analyses to start getting a feel for how our data is distributed and what kinds of trends and patterns are showing up.
All of the texts in this corpus are academic articles, so we might expect them to be reasonably similar in terms of length. When plotting the lengths, we find that most tend to center around 10,000 words, but there are some very long outliers.
tokencount <- ntoken(rawtokens_dfm)
#Calculate binwidth with the Freedman–Diaconis rule
binwidth_tokens <- 2 * IQR(tokencount) / length(tokencount)^(1 / 3)
#create a dataframe with token counts and disciplines
fxnword_normalized_df <- convert(fxnword_normalized_dfm, to = "data.frame")
fxnword_normalized_df$wordcount <- tokencount
fxnword_normalized_df$discipline <- metadata$Discipline
#plot histogram of lengths
ggplot(fxnword_normalized_df, aes(x = wordcount)) + geom_histogram(binwidth = binwidth_tokens) +ylab("Number of papers") + xlab("Word count")Further investigation the very long outliers appear to be driven by the discipline of physics.
Preliminary data analysis showed that the following function words had large differences in z-scores between disciplines: were, how, they, as. Below are the normalized frequencies of these four words across disciplines, showing differences.
#plot the four words
were_plot <- ggplot(data=fxnword_normalized_df,mapping=aes(x=reorder(discipline,-were, FUN = median), y=were)) + geom_boxplot(color="blue",fill="white") + theme_bw() + theme(text = element_text(size=6), axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Discipline") + ylab("were")
how_plot <- ggplot(data=fxnword_normalized_df,mapping=aes(x=reorder(discipline,-how, FUN = median), y=how)) + geom_boxplot(color="blue",fill="white")+ theme_bw() + theme(text = element_text(size=6), axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Discipline") + ylab("how")
they_plot <- ggplot(data=fxnword_normalized_df,mapping=aes(x=reorder(discipline,-they, FUN = median), y=they)) + geom_boxplot(color="blue",fill="white")+ theme_bw() + theme(text = element_text(size=6), axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Discipline") + ylab("they")
as_plot <- ggplot(data=fxnword_normalized_df,mapping=aes(x=reorder(discipline,-as, FUN = median), y=as)) + geom_boxplot(color="blue",fill="white")+ theme_bw() + theme(text = element_text(size=6), axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Discipline") + ylab("as")
#arrange into a visual
gridExtra::grid.arrange(were_plot, how_plot, they_plot, as_plot)The tables below identify 5 words for each discipline that are key in that discipline (compared to all of the other disciplines combined), and the effect size of those words. These tables are sorted by keyness values. A high keyness value indicates that there is a lot of evidence for difference in frequencies of that word between disciplines, while a high effect size indicates that the difference is quite large. Here, the keyness is calculated using log-likelihood and the effect size calculated using Hardie’s log ratio.
#create dfm grouped by discipline
keyness_dfm <- dfm_group(rawtokens_dfm, groups="Disc_short")
discipline_vector <- unique(metadata$Disc_short)
#define function for calculating effect size using Hardie's log ratio
effect_size <- function (n_target, n_reference) {
total_a <- sum(n_target)
total_b <- sum(n_reference)
percent_a <- ifelse(n_target == 0, 0.5 / total_a, n_target/total_a)
percent_b <- ifelse(n_reference == 0, 0.5 / total_b, n_reference/total_b)
ratio <- log2(percent_a / percent_b)
return(ratio)
}
#create list to store values in and data frame to print
keywords <- list()
keyword_table <- data.frame()
#loop through disciplines and store values in keywords list
#you can now access any discipline's keyness and effect size values with keywords[['[disc]']]. Ex: keywords[['cs']]
for (i in 1:length(discipline_vector)) {
#get keywords
keywords[[discipline_vector[i]]] <- textstat_keyness(keyness_dfm,
target=discipline_vector[i],
measure = "lr")
#add effect size
keywords[[discipline_vector[i]]] <- keywords[[discipline_vector[i]]] %>%
mutate(., effect = effect_size(n_target, n_reference))
#take away target and reference counts
pr <- dplyr::select(as.data.frame(keywords[[discipline_vector[i]]]), -n_target, -n_reference)
#rename columns
names(pr)[2] <- 'keyness'
names(pr)[3] <- 'p-value'
names(pr)[4] <- 'effect size'
#append to table for printing & save discipline name
keyword_table <- rbind(keyword_table, format(head(pr,n=5), nsmall=3, digits=3))
}
#print tables
library(knitr)
library(kableExtra)
kable(keyword_table, caption = "Keyness and Effect Size") %>%
kable_styling("striped", full_width = F) %>%
pack_rows(discipline_vector[[1]], 1,5) %>%
pack_rows(discipline_vector[[2]], 6,10) %>%
pack_rows(discipline_vector[[3]], 11,15) %>%
pack_rows(discipline_vector[[4]], 16, 20) %>%
pack_rows(discipline_vector[[5]], 21,25) %>%
pack_rows(discipline_vector[[6]], 26, 30) %>%
pack_rows(discipline_vector[[7]], 31,35) %>%
pack_rows(discipline_vector[[8]], 36, 40) %>%
pack_rows(discipline_vector[[9]], 41,45) %>%
pack_rows(discipline_vector[[10]], 46, 50) %>%
pack_rows(discipline_vector[[11]], 51,55) %>%
pack_rows(discipline_vector[[12]], 56, 60) %>%
pack_rows(discipline_vector[[13]], 61,65)| feature | keyness | p-value | effect size | |
|---|---|---|---|---|
| cs | ||||
| 1 | algorithm | 8471.886 | 0.000 | 5.011 |
| 2 | s | 8317.811 | 0.000 | 2.842 |
| 3 | v | 6973.859 | 0.000 | 3.443 |
| 4 | 1 | 6705.538 | 0.000 | 1.520 |
| 5 | we | 6579.818 | 0.000 | 1.416 |
| perf | ||||
| 11 | dance | 16781.315 | 0.000 | 7.781 |
| 21 | theatre | 12505.260 | 0.000 | 9.808 |
| 31 | ballet | 6446.625 | 0.000 | 9.350 |
| 41 | dancers | 5313.173 | 0.000 | 9.137 |
| 51 | performance | 5253.940 | 0.000 | 3.008 |
| cheme | ||||
| 12 | reaction | 11541.052 | 0.000 | 6.457 |
| 22 | co | 8537.453 | 0.000 | 7.129 |
| 32 | tio2 | 8329.074 | 0.000 | 13.039 |
| 42 | catalyst | 8130.326 | 0.000 | 8.530 |
| 52 | catalysts | 8063.386 | 0.000 | 10.601 |
| ling | ||||
| 13 | languages | 6614.868 | 0.000 | 5.202 |
| 23 | speakers | 5895.676 | 0.000 | 5.761 |
| 33 | english | 5514.495 | 0.000 | 3.630 |
| 43 | vowel | 5461.015 | 0.000 | 11.194 |
| 53 | verb | 4875.104 | 0.000 | 7.472 |
| ed | ||||
| 14 | teachers | 27742.143 | 0.000 | 7.025 |
| 24 | students | 19570.715 | 0.000 | 5.085 |
| 34 | school | 12642.213 | 0.000 | 4.486 |
| 44 | teacher | 11054.757 | 0.000 | 5.786 |
| 54 | children | 10595.175 | 0.000 | 3.833 |
| bio | ||||
| 15 | cells | 40573.610 | 0.000 | 7.962 |
| 25 | et | 12111.695 | 0.000 | 2.647 |
| 35 | al | 11948.197 | 0.000 | 2.622 |
| 45 | cell | 11105.182 | 0.000 | 5.826 |
| 55 | figure | 10242.081 | 0.000 | 2.938 |
| phil | ||||
| 16 | that | 13281.927 | 0.000 | 1.211 |
| 26 | is | 12517.445 | 0.000 | 1.172 |
| 36 | it | 7648.853 | 0.000 | 1.465 |
| 46 | if | 6404.190 | 0.000 | 1.977 |
| 56 | belief | 5154.391 | 0.000 | 4.465 |
| phys | ||||
| 17 | ¼ | 31406.271 | 0.000 | 9.343 |
| 27 | þ | 12035.933 | 0.000 | 8.039 |
| 37 | energy | 6992.521 | 0.000 | 3.963 |
| 47 | et | 6299.804 | 0.000 | 1.805 |
| 57 | al | 6294.149 | 0.000 | 1.795 |
| lit | ||||
| 18 | literary | 8200.181 | 0.000 | 7.032 |
| 28 | his | 8034.616 | 0.000 | 2.454 |
| 38 | narrative | 4757.758 | 0.000 | 4.303 |
| 48 | he | 3630.001 | 0.000 | 1.922 |
| 58 | poem | 3443.847 | 0.000 | 7.202 |
| soc | ||||
| 19 | social | 5372.210 | 0.000 | 2.329 |
| 29 | income | 5186.595 | 0.000 | 4.472 |
| 39 | women | 4698.982 | 0.000 | 3.210 |
| 49 | mothers | 4088.584 | 0.000 | 5.055 |
| 59 | racial | 3731.433 | 0.000 | 3.495 |
| stats | ||||
| 110 | x | 35224.322 | 0.000 | 3.987 |
| 210 | 1 | 31579.844 | 0.000 | 2.607 |
| 310 | n | 23528.688 | 0.000 | 3.424 |
| 410 | 0 | 13546.241 | 0.000 | 2.869 |
| 510 | k | 10535.447 | 0.000 | 3.220 |
| polsci | ||||
| 111 | party | 6912.617 | 0.000 | 4.739 |
| 211 | political | 6527.650 | 0.000 | 2.929 |
| 311 | parties | 5588.803 | 0.000 | 5.511 |
| 411 | democratic | 4692.951 | 0.000 | 5.107 |
| 511 | democracy | 4055.320 | 0.000 | 5.483 |
| hist | ||||
| 112 | had | 9317.536 | 0.000 | 2.666 |
| 212 | was | 7121.690 | 0.000 | 1.480 |
| 312 | british | 4781.092 | 0.000 | 4.800 |
| 412 | he | 4091.505 | 0.000 | 1.888 |
| 512 | his | 3851.584 | 0.000 | 1.737 |
In this section, we are interested in how papers cluster together based on the “distances” between them, and what particular words are driving these clusters.
There are many different ways to calculate the distances between texts. Some common distance measurements include Euclidean, Delta, Manhattan, Argamon’s, Canberra, Wurzburg, Cosine, and Min-Max. The basic idea behind all of these is to represent how similar two texts are, based on how similar their distributions of function words are. Thus, texts with very similar distributions of function words get a lower distance. The methods vary in their precise algorithm to calculate this distance.
Each graph below uses a different distance measurement. Based on these distances, the dendrograms below are created using agglomerative hierarchical clustering. Each node at the right represents one text, color-coded by discipline. It is not important (or feasible) here to see all of the text names, but it is interesting to see which distance algorithm produces the best clustering of papers (in other words, which graph shows the colors most tightly grouped together). It appears that Manhattan and Delta are doing a fairly good job, while Euclidean and Wurzburg seem a little messier.
#create data frame for analysis
stylo_df<-dplyr::select(fxnword_normalized_df, -wordcount, -discipline) %>% data.frame(., row.names=1)
#run results w various distance measures
stylo_euclidean_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='euclidean', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_delta_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='delta', display.on.screen = FALSE, write.jpg.file = TRUE, write.jpg.file = TRUE)
stylo_manhattan_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='manhattan', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_argamon_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='argamon', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_canberra_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='canberra', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_wurzburg_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='wurzburg', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_cosine_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='cosine', display.on.screen = FALSE, write.jpg.file = TRUE)
stylo_minmax_results <- stylo::stylo(gui=FALSE, frequencies = stylo_df, mfw.min = 240, mfw.max=240, plot.custom.height = 13, plot.font.size = 1.8, distance.measure='minmax', display.on.screen = FALSE, write.jpg.file = TRUE)In the keyness and effect size tables, it became apparent that pronouns were playing a large role in disciplinary discourse. Because pronoun use might be intuitive in some cases (for example, history may use a lot of “he” to talk about historical male figures that have been historically overrrepresented in comparison to women), it may be interesting to explore disciplinary differences when pronouns are not included in the function words analyzed. The graphs below repeat the analysis done before, but with pronouns left out.
Delta appears to be one of the strongest out of each measurement tested, and also has a lot of previous research around it to suggest its strength in authorship attribution (note that Manhattan is also especially strong, and it may be worth exploring some of the reasons for this). The figures below show both cluster analyses using the Delta measurement, both with and without pronouns, for comparison. There are not major differences between the two, though without pronouns appears to be slightly more accurate in grouping disciplines. The clusters themselves are also somewhat different.
Another way to visualize the distances here is through techniques drawn from network analysis, as laid out in Eder (2017). When dendrograms are created from the bottom-up, the two nodes with the smallest distance are connected, which disregards any information about which node was 2nd or 3rd closest. This information might also be useful for helping to see which texts are most closely related. Delta distances (without pronouns) are visualized below using Gephi (a data visualization program) with the ForceAtlas2 setting, which helps to better visualize clusters. Ultimately, this visualization allows us to more easily see clusters, and which clusters are close to each other, of texts.